[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
[VL] Add lazy per-column deserialization for Columnar Table Cache#12211jackylee-ch wants to merge 1 commit into
Conversation
|
Run Gluten Clickhouse CI on x86 |
58bd451 to
d5a0502
Compare
|
Run Gluten Clickhouse CI on x86 |
d5a0502 to
8e374db
Compare
|
Run Gluten Clickhouse CI on x86 |
8e374db to
0f0ccd2
Compare
|
Run Gluten Clickhouse CI on x86 |
0f0ccd2 to
8b09d6b
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@yaooqinn PTAL |
|
Thanks @jackylee-ch, V3 layout is a sensible extension of the cache-stats wire we landed in #12092 / #12196. Several things to discuss before this lands: 1. Benchmark needs to be re-run. The checked-in 2. Do we really need a new SQLConf? V3 functionally supersedes V2 (V3 frames also carry 3. Cross-language test parity vs #12196. V3 has no cpp-side byte-equal golden test; JVM-side tests synthesize their own frames via 4. Smaller items.
Happy to file any of these as separate issues if it helps. |
8b09d6b to
09679ee
Compare
|
Run Gluten Clickhouse CI on x86 |
09679ee to
ab9e0f7
Compare
|
Run Gluten Clickhouse CI on x86 |
ab9e0f7 to
144e816
Compare
b77f4ab to
9a0f96a
Compare
9a0f96a to
b5b1906
Compare
2b96545 to
c3cc1bd
Compare
|
Run Gluten Clickhouse CI on x86 |
c3cc1bd to
97a6019
Compare
|
Run Gluten Clickhouse CI on x86 |
97a6019 to
9971c91
Compare
|
Run Gluten Clickhouse CI on x86 |
9971c91 to
f576df8
Compare
|
Run Gluten Clickhouse CI on x86 |
f576df8 to
f17dc6a
Compare
|
Run Gluten Clickhouse CI on x86 |
f17dc6a to
cda20eb
Compare
|
Run Gluten Clickhouse CI on x86 |
decdd0e to
ab055c5
Compare
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
ab055c5 to
2538fe5
Compare
|
Run Gluten Clickhouse CI on x86 |
2538fe5 to
765794f
Compare
|
Run Gluten Clickhouse CI on x86 |
Write V3 per-column cache bytes by default for Velox table cache. Partition stats now only controls the optional stats/pruning payload: stats off writes a no-stats V3 frame, stats on writes V3 with stats, and older native libraries still fall back to V2 stats or legacy bytes. Add the V3 no-stats JNI/native serializer, JVM parsing for statsLen=0, cross-language golden coverage, and GitHub Actions benchmark execution without committing local benchmark results. Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
765794f to
c7f9e2f
Compare
|
Run Gluten Clickhouse CI on x86 |
What changes
This PR makes Velox table cache write V3 per-column framed bytes by default. Lazy materialization is a base table-cache capability;
spark.gluten.sql.columnar.tableCache.partitionStats.enablednow only controls the optional stats/pruning payload.spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled.statsLen=0) for the default lazy path.Performance
Four-environment comparison — eager
V2vs lazyV3, each without and with the optionalpartition-stats payload (
ColumnarTableCacheLazyDeserBenchmark):V2 without stats= legacy raw Presto payload (eager full-batch decode, no pruning).V2 with stats=framedSerializeWithStats(eager full-batch decode + partition-stats pruning).V3 without stats= per-column lazy payload (default; lazy projected decode).V3 with stats= per-column lazy payload + partition-stats pruning.100M rows / 32 partitions / 16 columns / 3 iterations, Apple M5 Pro, JDK 8 runtime, real Gluten
(off-heap enabled,
ColumnarCachedBatchSerializer). Read phases build one mode's cache at a time sothe full 100M fits. Times are avg ms, lower is better;
relativeis vsV2 without stats.Cache footprint (storage memory)
Footprint is identical across all four modes — V3 per-column framing does not regress cache size
for flat data, and the stats payload is negligible.
Read latency (avg ms / relative speedup vs V2 no-stats)
sum(c0)1 of 16 columns and 3.5x faster reading 4 of 16, versus eager V2 which decodes all 16.
additionally lazy-decodes only the surviving batches' projected columns, giving the best result at
136x (
V3 with stats). Lazy column-skip alone (V3 no-stats) is 6.8x.on par with / slightly faster than V2 (V3 ~1.3x at 2M), confirming
LazyVectoradds no overheadwhen every column is materialized. It is omitted from the 100M table because the eager-V2 path
decodes the full 100M x 16 off-heap and does not fit this 64 GiB laptop.
Net: V3 lazy per-column is a large win on projected/filtered reads (the common table-cache access
pattern) with identical cache footprint and no full-scan regression.
A GitHub Actions run on a larger-RAM runner can reproduce the same 100M comparison via the
Velox Backend (x86)workflow_dispatchbenchmark job.How was this patch tested?
./dev/format-scala-code.shPATH="/opt/homebrew/opt/llvm@15/bin:$PATH" ./dev/format-cpp-code.shgit diff --check upstream/main..HEADruby -e 'require "yaml"; YAML.load_file(".github/workflows/velox_backend_x86.yml"); puts "yaml ok"'./.github/workflows/util/check.sh upstream/mainenv CCACHE_DIR=/private/tmp/gluten-ccache ninja -C cpp/build velox/tests/CMakeFiles/velox_operators_test.dir/VeloxColumnarBatchSerializerTest.cc.o./build/mvn install -pl backends-velox -am -Pspark-3.5 -Pscala-2.12 -Pbackends-velox -DskipTests -Dexec.skipColumnarTableCacheLazyDeserBenchmarkwith1000rows,4partitions,1iteration, phasesbuild,read1,read4,readAll,filter.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex GPT-5